Guarantee missing stream promise delivery #12207

werkt · 2025-07-10T13:52:33Z

In observed cases, whether RST_STREAM or another failure from netty or the server, listeners can fail to be notified when a connection yields a null stream for the selected streamId. This causes hangs in clients, despite deadlines, with no obvious resolution.

This is not simply a race between netty to deliver a result interpreted as a failure and the setSuccess previously implemented, the netty layer does not report the stream as failed.

Fixes #12185

kannanjgithub · 2025-07-11T11:07:45Z

Thanks for the PR. Can you fix the failing unit tests?

werkt · 2025-07-11T17:09:39Z

@kannanjgithub tests pass locally with modifications (not sure if they're suitable given the logic switch), but the checks don't seem to be rerunning.

werkt · 2025-07-14T19:09:59Z

@kannanjgithub the only failure in Linux artifacts for Kokoro seems to be a content issue with apache's hosting from curl (retried the command locally to have it pass).

ejona86 · 2025-07-15T04:35:36Z

It seems clear we need a unit test that triggers this, because we'd be very likely to break it again. I don't actually see what case is being missing in the current code. The comment seems to be talking about a case that already has a test covering it. So do things just need a slight additional tweak to trigger the breakage?

grpc-java/netty/src/test/java/io/grpc/netty/NettyClientHandlerTest.java

Lines 261 to 273 in a8de9f0

    
           public void cancelBufferedStreamShouldChangeClientStreamStatus() throws Exception { 
        
             // Force the stream to be buffered. 
        
             receiveMaxConcurrentStreams(0); 
        
             // Create a new stream with id 3. 
        
             ChannelFuture createFuture = enqueue( 
        
                 newCreateStreamCommand(grpcHeaders, streamTransportState)); 
        
             assertEquals(STREAM_ID, streamTransportState.id()); 
        
             // Cancel the stream. 
        
             cancelStream(Status.CANCELLED); 
        
             assertTrue(createFuture.isSuccess()); 
        
             verify(streamListener).closed(eq(Status.CANCELLED), same(PROCESSED), any(Metadata.class)); 
        
           }

This does seem like a good line of investigation. Note that I've only skimmed this as of yet, so I could be misunderstanding.

werkt · 2025-07-15T15:40:46Z

@ejona86 I'm out of my depth here, so I apologize if this doesn't make any sense.

The handler that permits the stream to proceed to outboundWrites cannot locate its streamId on the connection.
It states in the original implementation's comment that this has a specific case of RST_STREAM having occurred, but I don't see how to confirm this - I'm adding debugging to my current tracing to try to figure out when we're stuck if there's any active streams, or if the streamId ever may have existed.
The handler then absolves responsibility and says that the connection should have delivered a CANCELLED to the listener, without making any assertions that this has taken place.
If this is the expected behavior of netty (that it always deliver a message to a listener that I can't follow the logic on how it is registered), then this sounds like a bug in netty, right?

Based on your ask, and the observation of this situation, you're looking for a test which exhibits this lack of notification on the listener (where netty is missing this delivery)? If so, doesn't there need to be a higher level listener registered than just at the handler level (that we assert was not called before delivering the promise failure)?

werkt · 2025-07-15T17:04:58Z

The state of the connection looks suspect. This is debugging output for each of my outstanding calls that indicates we're stuck:

(13:00:53) WARNING: Still Incomplete: Write 1, 0 failsafes: buildfarm/uploads/320c6987-17fd-4557-8825-dd206be1ba68/blobs/blake3/b5faea3085e4ecce461dbc1e100c6e5696390e4ce01503f693c3e8e80c5d3469/4350, running for 429s, position 0 (last query: -1), queries 0/0, state: Uploading (waiting for ready), But we've never been ready, And onReady was never called, Tracer: CREATED|HANDLER_STREAM_CREATING|OPTIONAL_LABEL: stream=237, numActiveStreams=27, {17, 29, 33, 45, 47, 59, 103, 111, 117, 135, 153, 155, 163, 171, 175, 179, 181, 183, 185, 187, 189, 191, 193, 195, 197, 199, 201}, streamMayHaveExisted=true

For all of the stuck streams, the id is greater than the highest active stream. These active streams don't seem to be doing anything, but exist in the plurality that would imply how many stuck calls I have. All of them 'may have existed', for whatever that means.

Looking at DefaultHttp2Connection.java in netty, I don't see any concern for concurrent modifications of these sequence ids, streamMap, or activeStreams. Maybe I'm missing some serialization that is preventing any possible threaded interaction with connection().local(), but I don't understand otherwise how this is not far more prevalent.

ejona86 · 2025-07-15T17:05:31Z

I think the only case where future.isSuccess() && http2Stream == null is guaranteed to have called transportReportStatus(stopDelivery=true) before the listener is run. So it shouldn't end up mattering whether we do promise.setSuccess() or promise.setFailure().

CANCEL is unlikely to be the right status code. How about we use INTERNAL when failing the promise with a message making it clear that should never happen. Maybe: "Sending headers succeeded but there was no http2Stream. The stream should already be killed and this status will be discarded"

That guarantees that in the case of a bug the RPC will still become closed, and it means we'll likely see a bug report to investigate such a case if it happens, to figure out what went wrong. And it won't be a hang; hangs are horrible to debug.

But that also means I don't think this fixes the issue you are hitting.

In the comment, "a stream buffered in the encoder" should be referring to Netty's StreamBufferingEncoder, which requires 1) the server to have a MAX_CONCURRENT_STREAMS and 2) it to be exceeded by the client. Could that be happening for you?

I'll note that transportReportStatus() is the normal way a listener is closed; the promise handling for writeHeaders() is a special case for when the RPC fails before Netty created the stream.

The RST_STREAM I think is referring to (note that it calls transportReportStatus() before the RST_STREAM), as it is the only trigger of writeRstStream() in our code:

grpc-java/netty/src/main/java/io/grpc/netty/NettyClientHandler.java

Lines 788 to 795 in be78878

    
           if (reason != null) { 
        
             stream.transportReportStatus(reason, true, new Metadata()); 
        
           } 
        
           if (!cmd.stream().isNonExistent()) { 
        
             encoder().writeRstStream(ctx, stream.id(), Http2Error.CANCEL.code(), promise); 
        
           } else { 
        
             promise.setSuccess(); 
        
           }

The reason != null was added for the case the server completed the call. I see three places that create CancelClientStreamCommand, all in NettyClientStream. Maybe one of them has a bug where reason == null causing transportReportStatus() to not be called:

transportHeadersReceived(): this uses null for the reason, but is fine because transportTrailersReceived() calls transportReportStatus()
http2ProcessingFailed. Already calls transportReportStatus(), so reason doesn't matter, although it is known non-null because transportReportStatus() has a checkNotNull().
cancel(). Is called by AbstractClientStream.cancel(), which has a checkNotNull()

So no bug there.

Looking more at StreamBufferingEncoder, and the places it removes streams from pendingStreams:

writeRstStream(). That's the case we are handling already. There is a risk here that there is a RST_STREAM being sent from the client that we aren't aware of, but since the stream hasn't been created in Netty yet there's not a lot of code that could do that
close() will fail the promise
tryCreatePendingStreams() will fail the promise if an exception occurs.
cancelGoAwayStreams() will fail the promise

So no bug there.

werkt · 2025-07-15T17:10:02Z

@ejona86 hangs are horrible to debug. - this.
I have this hanging in front of me at will, see my previous comment about the current state of things. I can poke at nearly any layer of what's happening here, is there anything you want me to do about getting more information out of this that could prevent this from attempting to bandaid whatever is really going wrong?

ejona86 · 2025-07-15T17:10:37Z

Looking at DefaultHttp2Connection.java in netty, I don't see any concern for concurrent modifications of these sequence ids, streamMap, or activeStreams. Maybe I'm missing some serialization that is preventing any possible threaded interaction with connection().local(), but I don't understand otherwise how this is not far more prevalent.

I still need to stare at the connection state more, but the model for Netty is all the state changes happen on a single thread: the event loop. There are multiple threads ("event loop group"), but when a connection is created it chooses one and uses it for its lifetime. A single event loop can handle multiple connections.

So there's no need for synchronization. However callback ordering can get pretty nasty; we have definitely seen problems in the past with callbacks being executed in a bad order and some call is made directly without popping up the stack first (e.g., reentrancy).

ejona86 · 2025-07-15T17:11:51Z

I have this hanging in front of me at will

at will. Nice. Does this PR fix the hang?

werkt · 2025-07-15T17:14:16Z

I have this hanging in front of me at will

at will. Nice. Does this PR fix the hang?

Yes, it does in all cases. I haven't proven that the expected null stream case ever occurs where we DO receive the CANCELLED (as implied/intended by the unchanged implementation), but this overreach in terms of exception delivery guarantees that we fall back to what becomes a retry (and eventually resolves).

ejona86 · 2025-07-15T17:32:25Z

I think I found it. I'll have to think a bit for how to fix it. This PR can definitely go in because it would prevent the RPCs from hanging forever, but I do think we should be using INTERNAL, because this was indeed a bug.

I realized I missed mentioning the case that the stream was created but then was killed before the callback was called. I considered it, but any further writes would have their callbacks called in appropriate order. And the RPC isn't known to the remote yet, so there shouldn't be any RST_STREAM.

But I missed GOAWAY and purposefully closing the stream, coupled with the write being buffered by Netty core I/O, not by StreamBufferingEncoder. If the stream is created but the receiving peer (the server) is a bit slower, then the HEADERS will be enqueued waiting to be sent.

There's two cases I see:

forcefulClose(). This calls transportReportStatus(), but only if the grpc transport state has been set on the Netty stream. But that happens in the callback in createStreamTraced() only after the headers have been sent. Note that this is only relevant when calling channel.shutdownNow()
goingAway(). The same as the last one; it would be unable to call transportReportStatus() if the headers haven't been written yet

Both of those "races" can only happen to the last few RPCs on a connection, as the HEADERS have to be buffered locally still.

In both of these cases, we have a proper error, but just aren't communicating it to the stream. We want to avoid making a status in createStreamTraced() because that method doesn't actually know the reason the RPC died.

werkt · 2025-07-15T17:53:34Z

I think I found it. I'll have to think a bit for how to fix it. This PR can definitely go in because it would prevent the RPCs from hanging forever, but I do think we should be using INTERNAL, because this was indeed a bug.

Sounds good, offer still stands to add any debug logging/trace to my at will reproducer to solve this the right way.

Just to confirm: I change CANCELLED to INTERNAL and you're a green stamp? Do you want the exception creation outside of createStreamTraced() for this?

ejona86 · 2025-07-15T17:58:24Z

@werkt, this should fix what you are seeing. It isn't a full fix, because it doesn't work if the stream was buffered initially by StreamBufferingEncoder and then later by the I/O subsystem when the GOAWAY was received. But I doubt you are seeing that case.

diff --git a/netty/src/main/java/io/grpc/netty/NettyClientHandler.java b/netty/src/main/java/io/grpc/netty/NettyClientHandler.java
index a5fa0f800..b455180bb 100644
--- a/netty/src/main/java/io/grpc/netty/NettyClientHandler.java
+++ b/netty/src/main/java/io/grpc/netty/NettyClientHandler.java
@@ -768,6 +768,10 @@ class NettyClientHandler extends AbstractNettyHandler {
             }
           }
         });
+    Http2Stream http2Stream = connection().stream(streamId);
+    if (http2Stream != null) {
+      http2Stream.setProperty(streamKey, stream);
+    }
   }
 
   /**

That is being run immediately after the encoder().writeHeaders(), before the listener is likely executed.

ejona86 · 2025-07-15T18:08:28Z

I change CANCELLED to INTERNAL and you're a green stamp?

Yes. That is strictly better while also not ignoring the fact that there is a bug in the code. But you'll probably start noticing the failure. My small patch should fix that for you (but not in all cases). We can get both of those fixes in our next release (originally scheduled for last week, but now looking like this week).

Do you want the exception creation outside of createStreamTraced() for this?

Really, the normal "proper" thing is to call stream.transportReportStatus(status, RpcProgress.MISCARRIED, true, new Metadata()), similar to the error path in that listener already. At that point we could still do promise.setSuccess() and avoid the exception creation all together.

But I'm quite willing to pay the exception creation, if it avoids an RPC hang. This is apparently pretty rare for most people, too, because I think this bug has always existed in grpc-java. (Aren't you lucky, to find such a bug!)

I expect your client and server aren't in the same datacenter; you're going over a slower link of some sort, or just pushing a lot of bytes, or the server is under some CPU pressure. With that in mind, I can understand how many people wouldn't see this, because it requires a race of sending an RPC when receiving a GOAWAY while the TCP connection is fully buffered.

werkt · 2025-07-15T18:22:55Z

@ejona86 Yahtzee. Your change alone also fixes the hang preliminarily. I'm going to run my overnight test to confirm for sure (set one up for the previous fix). Let me know what you want to do here, happy to close this assuming it turns out the same.

Yes, this is definitely against a server with high latency (22ms) and limited connectivity (1Gbps from client), in extreme concurrency (~100 streams/channel) with nearly full link saturation and cpu utilization. Server is well equipped for this, and is actually going through nginx to reach java again on the receive side.

In grpc#12185, RPCs were randomly hanging. In grpc#12207 this was tracked down to the headers promise completing successfully, but the netty stream was null. This was because the headers write hadn't completed but stream.close() had been called by goingAway().

ejona86 · 2025-07-15T22:09:35Z

I'd hope to still get this PR in. I'm doing my other half in #12222; this will be the back-up for when things still go wrong (as we know they still can).

In observed cases, whether RST_STREAM or another failure from netty or the server, listeners can fail to be notified when a connection yields a null stream for the selected streamId. This causes hangs in clients, despite deadlines, with no obvious resolution. Tests which relied upon this promise succeeding must now change.

werkt · 2025-07-16T00:02:35Z

I'd hope to still get this PR in. I'm doing my other half in #12222; this will be the back-up for when things still go wrong (as we know they still can).

So be it. I changed CANCELLED to INTERNAL. Tests will still fail in this condition, so leaving them alone.

werkt · 2025-07-16T18:53:06Z

@ejona86 I'm unable to get the kokoro tests to run locally, does the test failure make sense to you?

ejona86 · 2025-07-17T21:54:32Z

They were just flakes. I restarted it earlier and it seems to be passing now. The Java 17 is also a flake.

In #12185, RPCs were randomly hanging. In #12207 this was tracked down to the headers promise completing successfully, but the netty stream was null. This was because the headers write hadn't completed but stream.close() had been called by goingAway().

In grpc#12185, RPCs were randomly hanging. In grpc#12207 this was tracked down to the headers promise completing successfully, but the netty stream was null. This was because the headers write hadn't completed but stream.close() had been called by goingAway().

In #12185, RPCs were randomly hanging. In #12207 this was tracked down to the headers promise completing successfully, but the netty stream was null. This was because the headers write hadn't completed but stream.close() had been called by goingAway().

See also: - grpc/grpc-java#12207 - grpc/grpc-java#12222

werkt force-pushed the null-stream-promise branch 2 times, most recently from 470738a to be78878 Compare July 11, 2025 14:02

kannanjgithub added the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 14, 2025

grpc-kokoro removed the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 14, 2025

ejona86 added the TODO:backport PR needs to be backported. Removed after backport complete label Jul 15, 2025

ejona86 mentioned this pull request Jul 15, 2025

blockingUnaryCall withDeadlineAfter RPC request hangs forever #12185

Closed

ejona86 mentioned this pull request Jul 15, 2025

netty: Associate netty stream eagerly to avoid client hang #12222

Merged

werkt force-pushed the null-stream-promise branch from be78878 to 2a766c7 Compare July 16, 2025 00:01

ejona86 added the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 16, 2025

grpc-kokoro removed the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Jul 16, 2025

ejona86 approved these changes Jul 17, 2025

View reviewed changes

ejona86 merged commit a37d3eb into grpc:master Jul 17, 2025
15 of 16 checks passed

ejona86 mentioned this pull request Jul 17, 2025

Guarantee missing stream promise delivery (1.74.x backport) #12232

Merged

ejona86 mentioned this pull request Jul 17, 2025

netty: Associate netty stream eagerly to avoid client hang (1.74.x backport) #12233

Merged

ejona86 removed the TODO:backport PR needs to be backported. Removed after backport complete label Jul 24, 2025

jasonschroeder-sfdc pushed a commit to buildfarm/buildfarm that referenced this pull request Jul 25, 2025

Apply grpc-java patch to prevent hangs (#2313)

a4fbdad

See also: - grpc/grpc-java#12207 - grpc/grpc-java#12222

Guarantee missing stream promise delivery #12207

Guarantee missing stream promise delivery #12207

Uh oh!

Conversation

werkt commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kannanjgithub commented Jul 11, 2025

Uh oh!

werkt commented Jul 11, 2025

Uh oh!

werkt commented Jul 14, 2025

Uh oh!

ejona86 commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

werkt commented Jul 15, 2025

Uh oh!

werkt commented Jul 15, 2025

Uh oh!

ejona86 commented Jul 15, 2025

Uh oh!

werkt commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejona86 commented Jul 15, 2025

Uh oh!

ejona86 commented Jul 15, 2025

Uh oh!

werkt commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejona86 commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

werkt commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejona86 commented Jul 15, 2025

Uh oh!

ejona86 commented Jul 15, 2025

Uh oh!

werkt commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejona86 commented Jul 15, 2025

Uh oh!

werkt commented Jul 16, 2025

Uh oh!

werkt commented Jul 16, 2025

Uh oh!

ejona86 commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

werkt commented Jul 10, 2025 •

edited

Loading

ejona86 commented Jul 15, 2025 •

edited

Loading

werkt commented Jul 15, 2025 •

edited

Loading

werkt commented Jul 15, 2025 •

edited

Loading

ejona86 commented Jul 15, 2025 •

edited

Loading

werkt commented Jul 15, 2025 •

edited

Loading

werkt commented Jul 15, 2025 •

edited

Loading